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Introduction 


The purpose of this talk to to discuss ways to get better 
performance out of the Connection Machine. | assume a 
basic understanding of PARIS and the Connection 
Machine architecture. Most of the ideas presented in this 
talk are language independant, although | will present 
some of these in a specific language. Some of the ideas 
might necessitate programming at or below paris. 
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Introduction (con't.) 


Many of the ideas contained in this talk may apply only to the CM2 
and not to future TMC products. 


This is a living presentation. If any of you have performance 
enhancing ideas that | do not cover in this talk, please send 
them to performance-talk@think.com. Thinking Machines is free 
to do anything with messages sent to that address. 
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Speeds 


¢ All numbers scaled to 65,536 processors 
¢ Peak 
¢ 32 bit weitek 
2 ops/cycle * 8Meg cycles/sec * (65,536 / 32) procs 
32.768 GFLOP 
2 ops/cycle * 7Meg cycles/sec * (65,536 / 32) procs 
28.672 GFLOP 
- 64 bit weitek takes two cycles to load/store 64 bit 
floating point numbers because of 32 bit bandwidth 
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Processor Section 


32 sets CM Chip 


RAM processors 
and router 


status 


sequencer address 


Section is repeated 2,048 times in a full CM-2 
Instructions are broadcast 


Memory addresses can be broadcast or locally generated (indirect addressing) 


The memory and floating point units are commercial parts 
CM and Sprint chips are standard cell/custom CMOS parts 


Floating Point 


64K x 1 bits 16 bit-serial router wires 


CM Chip 

8 Data chips 16 bit-serial 

3 ECC chips processors router wires 
and router 


Unit 
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Connection Machine Chip 


Instruction 
readback pins Global-or 


16+6ECC 


Cube-pins 
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Sprint Chip 


Memory Float Address FPU Status 
Bus Bus Bus Bus 
32 32 
| offset _| 
20 


32132 32|,32 


Transpo 
Fansposen |/7 Transposer a Transposer Status 
TO TI T2 Transposer 


¢ Bit serial t+ word parallel conversion 
¢ Floating point interface 
¢ Indirect addressing 
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Floating Point Unit 


Sprint Chip 


Instruction from 
sequencer Register File 


32 or 64-bit x 32 


Multiply 


Standard IEEE part 

Batch 32 processors gets vector performance 
Virtual Processor pipelining 

Floating point add or multiply is one "Add times" 
¢ Floating point divide is 3 "Add times" 
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VP Ratio 


¢ Definition 
PP - Number of Physical Processors 
VP - Number of Virtual Processors 

VPR = VP/PP 
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Memory 


¢ Memory 
¢ VPR things per physical processor 
¢ Can be wasteful 
Ex: 
add_const(dest,const,length) 


temp = allocate(length) 
move_const(temp,const,length) 


add(dest,temp,length) 


VPR copies of constant in each physical processor 
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Performance 


¢ Naive 
C1 + VPR(C2 + TB) 
C1: instruction overhead 
C2: vp loop overhead 
T: time per bit 
B: number of bits 
¢ Many important sublinear cases 
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Instruction Performance 


Top = Tfd + Tfp + Tcm + Tother 
Tfd = Field decoding time 
Sun4,VAX 6220 ~1us 
LispM, VAX 8250 ~8us 
Tfp = Fifo Push time 
Sun4 ~4-211s 
VAX ~ 2-4us 
LispM ~#8-9 AS 
Tcm = Instruction time on the front end 
depends on VP ratio & instruction 
Tother = other front end time 
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VP Looping 


¢ VP looping in microcode 
« because the cm is loosly coupled to the front end (through fifo) 
Tfd, Tfp and Tother can be overlapped with Tcm from previous 
instructions 
¢ VP looping on front end 
¢ Tid, Tfp and Tother can be overlapped with Tcm from previous 
instructions or previous VP loop. 
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e linear 
means that the formula on the CM is C1 + VPR(C2 + TB) 


¢ integer 
¢ boolean 
- software floating point 
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Router 


¢ speeding up the router 
eAt higher vp ratios (>= 4), the constant overhead per iteration 
(C2) is higher than at lower vp ratios. 
- if possible, only route within vp banks 
at high vp ratios (>= 4) the CM using sprint routing which 
uses indirect addressing. 


¢ For routing at these vp ratios, try to first move your data around within 


the processor, then using a bunch of low vp sends and then moving 
the data around within the processor again. 
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* sometime jumbling the data speeds things up (if you have 
a nasty collision pattern) 


- longer messages are cheaper (per bit) than shorter 
messages 
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Router Compiler 


Although the router is one of the key performance 
wins in the CM, it can be nonoptimal for static routing 
patterns (once that stay constant for many calls to the 
router). 


Reason for this include: 


- Dynamic redirection of messages 
- Inclusion of destination address in message 
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Router Compiler 


We have built a routing compiler that can write over the hypercube 
wires directly (using CMIS) and can often get a 2-3 times 
performance improvement over SEND's 


By trying to minimize the total communication cost (number of 
messages over number of wiress, total hamming distance of 
all messages, etc) by rearranging the mapping of grid points 
to processors it may be possible to get another factor of 2 

over that. 
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¢ Sublinear 
means that the formula on the cm is: 
C1+ VPR1(C2 + T1B) 
+ VPR2(C3 + T2B + T3B) 
+ VPR3(C4 + T4B) 


where VPR = VPR1+ VPR2 + VPR3, 
T2B and T3B can be overlapped 


in general (for high vp ratio's): 
VPR2 >> VPR1 and VPR2 >> VPR3 
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- hardware floating point 
- Transposing and Operating can be overlapped. 
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Examples 


AopB->A 32 bit :conditional or always 
#cycles MB FB 

32 BO->Td free 

32 AQ->Ta Td->Reg 

32 +---->B1->Td Taop Reg->Reg 


oy. | A1->Tb Reg->Tc or Ta->Tc 
oy | Tc->A0Q Td->Reg 

sY2 | B2->Tc Tb op Reg->Reg 
3ye | A2->Ta Reg->Td or Tb->Td 


32 4--->Td->A1 Tc->Reg 
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Examples 


AopB->C 32bit: 


#cycles MB 

32 BO->Td 
32 A0->Ta 
32 0 +------ >B1->Td 
SY, | A1->Tb 
32 sCéd| Tc->C0 
32 Cs B2->Tc 
sys | A2->Ta 
32 9 +------ >Td->C1 


always 

FB 

free 

Td->Reg 

Ta op Reg->Reg 
Reg->Tc or Ta->Tc 
Td->Reg 

Tb op Reg->Reg 
Reg->Td or Tb->Td 
Tc->Reg 
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Examples 


- compound (with constant) operation 
A op Const-> B 32 bit :always 


#cycles MB FB 

2 Const->Bypass free 

1 free Bypass->Reg[31] 
32 A0->Ta free 

32 free Ta op Reg[31]->Reg 
32 +----> A1->Tb Reg->Tc or Ta->Tc 
1 | free Bypass->Reg[31] 
See Tc->B0 Tb op Reg[31]->Reg 
32m A2->Ta Reg->Td or Tb->Td 

1 | free Bypass->Reg[31] 


Ta op Reg[31]->Reg 
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Examples 


Third order polynomial evaluation for all of the instructions. 


Poly (A) ->A 
32 bit :conditional or :always 


Comment: Constants for various functions have already been 
loaded into the Weitek chip. They are in temp registers 
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Examples 


#cycles MB FB 
32 AO->Ta ___ free 
32 A1->Tb (Ta* C3) + C2->Reg 


32 +---->free (Ta * Reg) + C1->Reg 


32 | free (Ta * Reg) + C0->Reg 
32 | free Reg ->Tc 

32 | Te->A0 (Tb* C3) + C2->Reg 
32 | A2->Ta (Tb* Reg) + C1->Reg 
32 | free (Tb * Reg) + CO->Reg 
32 | free Reg ->Td 


32 +-----Td->A0 (Ta* C3) + C2->Re 
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¢ 3 parts 
¢ on processor 
-on chip 
¢ off chip 
¢ Two measures of news cost: 
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News Performance 


total-vp-ratio * (cO*length + c1) 
+ (total-vp-ratio/axis-vp-ratio) * (c2*length + C3) 
+ ((244)/(240n-chip-bits) : total-vp-ratio/axis-vp-ratio) 


* length * c4 
+ C5 


Where the C terms are various overhead costs. 
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Although scans scale up sublinearly with vp ratio, it is important to 
remember that they run in logarithmic time in the number of 
physical processors in the machine. 


As one researcher at TMC put it: 
"Try to think poe scans when you can't figure out any other 
way to parallize code". 

Ex: Solving a recurrence relationship. 

Solve for X(n) given the formula: 

X(i) = Z(i)* (y(i) - x(-1)) 

X(0) =c 
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The solution can be derived by writing out several terms and 
examining the pattern: 

x1 = 21 y1-z1x0 

X2 = Z2 y2- 2221 y1 +2221 x0 

x3 = Z3 y3 - Z3 22 y2 + 23 Z2 z1 y1 + 23 22 21 x0 

x4 = 24 y4 - 2423 y3 + 24 23 22 y2 - 24 Z3 z2 21 yl + 24 23 22 z1 x0 
Noticing the pattern of Z terms it becomes clear that you can 

solve for X(n) with a couple of elemental operations and a couple of 
scans. 
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Field Aliasing 


- the ability to look at memory in a different way (to slice of VP's 
differently) 


¢ saving memory on temps 
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Field Aliasing 
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Field Aliasing 
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- Applicable to All Languages 
- explicit: can add as a library call 
- implicit: can have the compiler emit this 
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¢ increasing performance 
Remembering the formula for most paris operations with 
linear speedups: C1 + VPR(C2 + TB), we can examine the 
costs of doing a single bit logical op (always) at various vp 
ratios. 
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Problem: Do a 1 bit logical operation at a VP ratio of 256: 


At a vp ratio of 256, 


t= C1 + 256(C2 + 1T) = C1 + 256C2 + 256T 


using field aliases you can opt to view the field as a 256 bit 
field at a vp ratio of 1, 


t=C1+1(C2 + 256T)=C1+ C2+256T 


Obviously, if posssible using aliasing here is a big win! 
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¢ Can work for any logical operation 

¢ Can work for things like unsigned add, by allowing extra overflow bits 
¢ Caveats 

¢ Only work for always operations 

¢ could substantially slow things down at low vp ratios (%20) 
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Taking out VP looping 


¢ many of the same advantages as field aliasing 
¢ possibility of keeping things in registers 
¢ on sequencer 
¢incm chip 
¢ in sprint chip 
- in floating point chip 
¢ compilers can often do this 
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Performance 


¢ Other performance enhancers 

. Use always instructions if possible 

¢ Try to do conditionalization by weights 
¢ be careful of page faults 
- use high level primitives 
¢ stencils 

¢ math library 

¢ perhaps investigate CMIS 
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Conclusion 


Most CM applications only get a fraction of the achievable 
FLOP/MIP rate. At TMC we are working on improving the 
languages and compilers to get closer to peak. In the interim there 
are a lot of various ways to get more performance (Some much 
easier than others). By using some of these means (with or without 
help from TMC), it is often possible to get substantial code speed 
ups. 


Thinking Machines Corporation 


"orb gow 1Wic) i oyeu boee pe fo Gap enpesuns; cage” 
_ B8SI6L (usu O;etel BA new eours of (eas Weste (ay 
- @LG 5 Or C] ASLION? AeAe (0 Gey wore Heuorueuce (2owe sane 
_-PeuBine@ee auq combyo.e to Ges croset io beac 2 Ae FoVeLpU EELS. 
LP ObiWilb LIS" YI LIC 46 ste module oy mbtonqug qpe 
mes; Cy) sbbycsuovue oujA Ges & PSCHOU Oo] f7S Seppeaspie 


) | COuUCHIgIOu 


t& - 


: : : 
ins ee 
- ¢ ; 
a 4] 
_ 
= - 
oe Sg a ——__ g  lm S 


Acknowlegements 


Creon Levit (NASA) creon@orville.nas.nasa.gov 
Dan Aronson (TMC) dan@think.com 

Guy Blelloch (CMU, TMC) guyb@sam.cs.cmu.edu 
Mark Bromley (TMC) bromley@think.com 

Denny Dahl (TMC) denny@think.com 

Brewster Kahle (TMC) kahle@think.com 

JP Massar (TMC) massar@think.com 

Bernie Murray (TMC) bernie@think.com 

Alex Vasilevsky (TMC) alex@think.com 


Thinking Machines Corporation 


i 
ale ee a Te Ti 


WI ASANGASKA (ANIC) s1x@muK cou Sal As 


BSLuIs WhAeA (LINC) psuUpeGeuER COM 

‘ib WwSeest (LNG) wsezsi@iuu cou 

Gismaist KIBO (WC) , Kepfominw cou 

peuvd psy) - “(uinc) qaevuAGmpaircolu 

WELK BLOWEA = (LLNC) PLOWCA@ aU COM 

GA pieiecy (CIM LINC) GnApesw es curren 
Hey yiousou ~ (1c) GSLNOipilK cow 

Gieou reat (uyey? cLeou COLATy@ Le Deee COA 


‘yoKuompedemieute 


Connection Machine Architecture 


Dan Aronson 


Thinking Machines Corporation 


— = so 


nev yroveoy 


COULUSCHIOU INSCHING YLOpHeC NLS 


A —— — 


16K 

processors 
2 Gb 

64 bit FP 


16K 25 Mbytes/Sec ae 
processors 25 Mbytes/Sec 


2 Gb 
64 bit FP 
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processors 
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64 bit FP 
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Graphic Graphic 


VAX 6XXX| 


Display Display 


Connection Machine System 


OUTLINE 
¢ Connection Machine system architecture 
¢ Front end 


¢ CPU 
¢ Bit Serial Processors 
¢ Grid Communication and Scans 
- Router (combining, backwards, 
indirect addressing) 
Indirect addressing hardware 
¢ Floating Point Unit 


¢ Frame buffer 


¢ DataVault 
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Front End Architecture 


Serial Computer 


Cache & 
Memory 


Bus Adaptor 


CW 


VME, BI, LBus 
Memory Bus 


User programs run on front end 

Macrocode calls are made to the sequencer 
Front end and CM sequencer are closely coupled 
Performance for array transfer: 


¢ VAX 8800 ~—» sequencer 1 MByte per sec 
¢ LISPM —=—. sequencer < 3 MBytes per sec 
¢ SUN 4 ~<«—> sequencer 4 MBytes per sec 
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The Connection Machine System 


Processor/Memory/ 
Communication 


Frame Buffer 


ULTRIX 


LISPM <3 MB/seqd | w 
| 
SUN 4 4 MB/sec T 
UNIX 
C 
H 


16K processors 25-40 MB/sec 


DataVault 
Disk Farm 


16K processors 20 MB/sec 


VME" 
/O 


16K processors 


Other 
High-speed 
Interfaces 


* not completed 
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16K Processor CPU 


/O Bus 


ee 
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/O 
° LLL: BRMUeeE 
Instruction 
Bus 
Switch (Nexus) Sequencer 


* Connects front ends to 16K processor blocks ¢ 8MHz bit slice controller 
¢ Full cross bar ¢ 128 bits x 64K micro control store 
« Broadcast instructions ¢ Synchronous with CM 
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Processor Section 


32 sets 


64K x 1 bits 
RAM 


CM Chip 
16 bit-serial 
processors 
and router 


router wires 


CM Chip 
16 bit-serial 
processors 
and router 


8 Data chips 


3 ECC chips 
router wires 


enable Floating Point 


Unit 


status 


sequencer address 


Section is repeated 2,048 times in a full CM-2 

Instructions are broadcast 

Memory addresses can be broadcast or locally generated (indirect addressing) 
The memory and floating point units are commercial parts 

CM and Sprint chips are standard cell/custom CMOS parts 
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Sprint Chip 


Memory _— Float Address FPU Status 
Bus Bus Bus Bus 
32 32 
mOTcety 
20 
32/,32 32/,32 
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Transposer 
0 Bs Transposer 39 Transposer Status 
* - T2 Transposer 


¢ Bit serial ———* word parallel conversion 
¢ Floating point interface 
¢ Indirect addressing 
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Floating Point Unit 


Sprint Chip 


Instruction from 
sequencer Register File 


- 32 or 64-bit x 32 


Multiply 


Standard IEEE part 

Batch 32 processors gets vector performance 
Virtual Processor pipelining 

Floating point add or multiply is one "Add times" 
Floating point divide is 3 "Add times" 
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Transposer Diagram 


Data In 


Address Register Address Register 


Data Out 


¢ bit serial ot ¢ word parallel conversion 
¢ word parallel SETS ¢ bit serial conversion 
¢ 32-cycle latency 
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Connection Machine Chip 


Instruction 


readback pins Global-or 


16+6ECC 
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DataVault Hardware 


¢ Parallel File System 

¢ 32 + 8 ECC + 3 spare drives 
¢ High Data Transfer Rate 

¢ 25 - 40 MBytes per second 
¢ Data Protection 


¢ Full ECC 
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UNIX - LIKE HIERARCHICAL FILE SYSTEM 
EACH DEVICE IS INDEPENDENT FILE SYSTEM 
EACH DEVICE HAS A FILE SERVER 


FRONT END AND ALL FILE SERVERS ARE LINKED VIA 
LOCAL ETHERNET FOR COMMAND AND CONTROL 


STANDARD UNIX DEVICE/ FILE SYSTEM CALLS 


SEINE A STR Le Ee 


SESE ATARI RE NS RE hig, 


BROS PREG ROS RR XD GOERS SIR ZS GN SR 


eek eee 


SRE a 


—e 


EE 


SRP RE EERE EEE EEE EEE Ea EE 


TTI TTT ATT 


32-bit Data Words 


SPER ES 


or 


Ne. : 
Sot OFT 8 Ra 


res 
Spare 
Spare 


Spare 


7 bits of 
ECC code 
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DataVault Mass Storage 


¢> 150 MBytes per second 
aggregate transfer rate 


- Up to 64 DataVaults 
(2.5 TBytes) per system 


¢ Full redundancy and 
Data Healing 
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Color Graphic Display 


1280 
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Frame Buffer Hardware 


¢ Pixel resolution 
¢ high: 1280 x 1024 60 Hz non-interlaced 
¢ NTSC (television): 640 x 480 30 Hz interlaced 


¢ Color resolution 
¢ "24 bit" direct color (red, green, blue: 8 bits each) 
¢ "8 bit" pseudo-color (uses color lookup table) 
¢ overlay (4 bits) 


¢ High speed 
¢ plugs into Connection Machine backplane 
¢ time to write one full image on a 16K CM-2 
¢ NTSC 8 bit: 15 to 50 ms 
¢ high 24 bit: 165 to 270 ms 


¢ Panning and zooming 
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SCANS 


¢ Use all hypercube wires directly 
¢ Very fast operation because of the regularity 
¢ Performance: 25 "Add times" 


¢ Faster with high virtual processor ratio 
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Data 


Max - Scan 
(inclusive) 


Add - Scan 
(inclusive) 


Add - Scan 
(exclusive) 


SCANS: An Example 
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In-Router Combining 


¢ Improves performance for clustering sends, 
e.g., histogramming 


¢ Run time fan-in tree construction 


¢ Detects message with same destination on 
each router and combines them 


¢ Operations 
¢ Integer Add 
¢ Integer Max 
¢ Overwrite 

¢ Logical Or 
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Router 


¢ One router per CM chip, 4096 (212) total 
¢ Each connected to 12 neighbors along hypercube 


¢ Packetized, pipelined, load balancing, store and 
forward algorithm 


¢ In router combining 
¢ Store router state 
¢ Backwards routing 
¢ Sprint routing 
¢ Performance 
¢ 1.5 miliseconds per VP for 32-bit random send 


- ~1 GBit per second throughput\ 
- 75 "Add times" 
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Backward Routing 
(Hardware "'Get") 


¢ Eliminates hot-spots on "shared-memory" accesses 


¢ Router can store its switch state in memory, including enc 


combining information (~ 400 bits per virtual processor) 
e Router can run "backwards" 
« Run time fan-out tree construction 


Routing time is 2 times forward routing 
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Sprint Chip Routing 


Two tier routing 
¢ hypercube routing 
¢ indirect addressing 


Indirect addressing for message delivery 


Linear routing time with virtual processors 


Part of Release 5.0 
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NEWS 


1,2, 3,4, ... dimensions supported 


Grid communications over the hypercube wires” - 


Faster than the router 
¢ no address part of the message 
¢ local messages get delivered directly 


Performance: 6 "Add times" 


Faster with high virtual processor ratio 
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Frame Buffer Hardware 


Pixel resolution 
«high: 1280 x 1024 60 Hz non-interlaced 
«NTSC (television): 640 x 480 30 Hz interlaced 


Color resolution 
° "24 bit" direct color (red, green, blue: 8 bits each) 
- "8 bit" pseudo-color (uses color lookup table) 
¢ overlay (4 bits) 


High speed 
¢ plugs into Connection Machine backplane 
¢ time to write one full image on a 16K CM-2 
¢ NTSC 8 bit: 15 to 50 ms 
¢ high 24 bit: 165 to 270 ms 


Panning and zooming 
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